Skip to content

feat(pool-manager): rework stuck-runner cleanup to eliminate OOMs #5

Merged
elasticroentgen merged 1 commit into
mainfrom
poolmanager-oom
Jul 21, 2025
Merged

feat(pool-manager): rework stuck-runner cleanup to eliminate OOMs #5
elasticroentgen merged 1 commit into
mainfrom
poolmanager-oom

Conversation

@gcordalis
Copy link
Copy Markdown
Contributor

Why

  • A production incident (System.OutOfMemoryException during CheckForStuckRunners) showed that we were loading all runners and their full Lifecycle collections into memory, then letting EF Core’s change-tracker churn on tens of thousands of entities.
  • Each cleanup cycle recreated that pressure, eventually crashing the autoscaler pod and stopping the host.

What

  • Query trimmed to SQL only
    • Replaced Include(...).AsEnumerable().Where(...) with a pure LINQ-to-Entities filter: LastState == Created && CreatedTime < now-10min.
    • Removed unconditional Include(x => x.Lifecycle).
    • Added AsNoTracking() and a projection to an anonymous type (Select(r ⇒ new { … })) so no full Runner entities are tracked.
  • Context lifetime & tracking
    • Scoped ActionsRunnerContext with await using – guarantees disposal after the method exits.
    • Disabled auto-detection of changes only for this batch (ChangeTracker.AutoDetectChangesEnabled = false) to minimise change-tracker work.
  • Batch insert of lifecycle events
    • Accumulate new RunnerLifecycle rows in an in-memory list and call AddRange once instead of adding per runner.
    • Single SaveChangesAsync() at the end → one DB round-trip.
  • Queue deletion without materialising collections
    • For every stuck runner enqueue a DeleteRunnerTask directly using the projected key data (no need for full entity).
  • Logging
    • Added explicit warning log per stuck runner and kept the original message structure for easy grepping.

Result

  • CheckForStuckRunners now pulls only the small “actually stuck” set into memory and never tracks existing Lifecycle rows.

IN CLAUDE WE TRUST

… cut DB load

### Why
* A production incident (`System.OutOfMemoryException` during `CheckForStuckRunners`) showed that we were
  loading **all** runners and their full `Lifecycle` collections into memory, then letting EF Core’s
  change-tracker churn on tens of thousands of entities.
* Each cleanup cycle recreated that pressure, eventually crashing the autoscaler pod and stopping the host.

### What
* **Query trimmed to SQL only**
  * Replaced `Include(...).AsEnumerable().Where(...)` with a *pure* LINQ-to-Entities
    filter:
    `LastState == Created && CreatedTime < now-10min`.
  * Removed unconditional `Include(x => x.Lifecycle)`.
  * Added `AsNoTracking()` and a **projection** to an anonymous type
    (`Select(r ⇒ new { … })`) so no full `Runner` entities are tracked.
* **Context lifetime & tracking**
  * Scoped `ActionsRunnerContext` with `await using` – guarantees disposal after
    the method exits.
  * Disabled auto-detection of changes only for this batch
    (`ChangeTracker.AutoDetectChangesEnabled = false`) to minimise
    change-tracker work.
* **Batch insert of lifecycle events**
  * Accumulate new `RunnerLifecycle` rows in an in-memory list and call
    `AddRange` once instead of adding per runner.
  * Single `SaveChangesAsync()` at the end → one DB round-trip.
* **Queue deletion without materialising collections**
  * For every stuck runner enqueue a `DeleteRunnerTask` directly using the
    projected key data (no need for full entity).
* **Logging**
  * Added explicit warning log per stuck runner **and** kept the original message
    structure for easy grepping.

### Result
* `CheckForStuckRunners` now pulls only the small “actually stuck” set into
  memory and never tracks existing `Lifecycle` rows.

**IN CLAUDE WE TRUST**
@gcordalis
Copy link
Copy Markdown
Contributor Author

│ [21:44:00 INF] Cleaning runners...                                                                                                                                                │
│ [21:44:10 ERR] BackgroundService failed                                                                                                                                           │
│ System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.                                                                                          │
│    at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.ChangeDetector.LocalDetectChanges(InternalEntityEntry entry)                                                          │
│    at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.ChangeDetector.DetectChanges(IStateManager stateManager)                                                              │
│    at Microsoft.EntityFrameworkCore.ChangeTracking.ChangeTracker.DetectChanges()                                                                                                  │
│    at Microsoft.EntityFrameworkCore.DbContext.TryDetectChanges()                                                                                                                  │
│    at Microsoft.EntityFrameworkCore.DbContext.SaveChangesAsync(Boolean acceptAllChangesOnSuccess, CancellationToken cancellationToken)                                            │
│    at GithubActionsOrchestrator.PoolManager.CheckForStuckRunners(List`1 targetConfig) in /src/PoolManager.cs:line 176                                                             │
│    at GithubActionsOrchestrator.PoolManager.ExecuteAsync(CancellationToken stoppingToken) in /src/PoolManager.cs:line 90                                                          │
│    at Microsoft.Extensions.Hosting.Internal.Host.TryExecuteBackgroundServiceAsync(BackgroundService backgroundService)                                                            │
│ [21:44:10 FTL] The HostOptions.BackgroundServiceExceptionBehavior is configured to StopHost. A BackgroundService has thrown an unhandled exception, and the IHost instance is sto │
│ System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.                                                                                          │
│    at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.ChangeDetector.LocalDetectChanges(InternalEntityEntry entry)                                                          │
│    at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.ChangeDetector.DetectChanges(IStateManager stateManager)                                                              │
│    at Microsoft.EntityFrameworkCore.ChangeTracking.ChangeTracker.DetectChanges()                                                                                                  │
│    at Microsoft.EntityFrameworkCore.DbContext.TryDetectChanges()                                                                                                                  │
│    at Microsoft.EntityFrameworkCore.DbContext.SaveChangesAsync(Boolean acceptAllChangesOnSuccess, CancellationToken cancellationToken)                                            │
│    at GithubActionsOrchestrator.PoolManager.CheckForStuckRunners(List`1 targetConfig) in /src/PoolManager.cs:line 176                                                             │
│    at GithubActionsOrchestrator.PoolManager.ExecuteAsync(CancellationToken stoppingToken) in /src/PoolManager.cs:line 90                                                          │
│    at Microsoft.Extensions.Hosting.Internal.Host.TryExecuteBackgroundServiceAsync(BackgroundService backgroundService)                                                            │
│ Stream closed EOF for github-runner-manager/ef-github-runner-manager-github-autoscaler-5f7f945778-jctbm (github-autoscaler)    

Noticed this and saw the pod and restarted many many times. I can't do .NET but thought I'd let Claude have a crack and let you know :)

@elasticroentgen elasticroentgen merged commit 5b57ac9 into main Jul 21, 2025
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants